{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "```{warning}\n",
    "This chapter is under construction.\n",
    "```\n",
    "\n",
    "# Coding for Economists Quickstart\n",
    "\n",
    "There's a trade-off in spending time wading through a book, potentially improving your human capital for the future, when you could be doing something that gives you more immediate utility. So I've created this quickstart tutorial to give you a taste of coding by covering a mini-project from end-to-end. It should take you no more than an hour to run through, and **you can follow it by loading this page interactively in Google Colab by clicking [here]** without installing anything.\n",
    "\n",
    "We'll use a range of techniques that you'll see explained in more detail in the rest of the book, including:\n",
    "\n",
    "- some basic [coding](#coding)\n",
    "- an example project that will involve:\n",
    "    - [reading, exploring, and cleaning data](#reading-in-and-clean-data)\n",
    "    - performing [analysis](#analysis) on the cleaned data\n",
    "\n",
    "Along the way, we'll see how to import extra software packages, how to make plots, and more.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Coding\n",
    "\n",
    "A typical segment of computer code might look like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "variable = 10\n",
    "print(variable)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we created an object named `variable`, assigned a value to it (10), and then printed it to screen. An object is just a container for something--it could be a number, a phrase, a function (that takes inputs and creates outputs), a list of other objects. Instead of doing operations on a number directly, `print(10+5)`, objects allow us to perform operations on containers that could have any number in:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "α = 5\n",
    "variable = variable + α\n",
    "print(variable)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The strength of this is that we can now perform much more impressive operations with a few lines of code. Say we wanted to add 5 to a list of numbers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "list_of_numbers = [10, 20, 30, 40]\n",
    "new_list_of_numbers = [x + 5 for x in list_of_numbers]\n",
    "print(new_list_of_numbers)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In high-level open source languages like Python and R, *everything is an object, and every object has a type*. You've already seen two types: integers, like `10` and lists, like `list_of_numbers`. You can always check a type like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [],
   "source": [
    "welcome_msg = \"Hello World!\"\n",
    "type(welcome_msg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Almost all programming languages come with the built-in types integers, floats (real numbers), strings (as above). In Python, there are also lists, dictionaries, sets, tuples, iterators, characters, functions, and so on. The extra packages that you can install to extend the functionality of the base language can add new types, and you can define your own types too. Types work sensibly together by default, for instance you can combine an integer and a float via addition, subtraction, multiplication, etc., and get a sensible answer, and adding two strings together concatenates them.\n",
    "\n",
    "### Functions\n",
    "\n",
    "It's best practice to never repeat yourself in code, the so-called DRY (do not repeat yourself) principle. Functions are the workhorse of not repeating yourself because they can be re-used.\n",
    "\n",
    "In the example below, we'll define a function that adds a number to every element in a list. The text at the top of the function is called a docstring. It gives information on what the function does.\n",
    "\n",
    "\n",
    "By default, the number it adds is 5. But we'll also define the number to add as a *keyword argument*, which means that we can override the default by supplying whatever number we like. The example below also makes use of `range(n)`, which creates a range of numbers from 0 to n-1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def add_number_to_list(input_list, number=5):\n",
    "    \"\"\"Function that adds a number to each element of a given list.\n",
    "    Default behaviour is to add 5 to the list.\n",
    "    \"\"\"\n",
    "    return [x + number for x in input_list]\n",
    "\n",
    "\n",
    "list_of_numbers = list(range(10))\n",
    "\n",
    "# Use default\n",
    "print(add_number_to_list(list_of_numbers))\n",
    "# Override default\n",
    "print(add_number_to_list(list_of_numbers, number=10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The beauty of this approach is that the code can be re-used in different situations. This is a very simple example--in reality, there are more concise ways to do this--but it conveys the idea of re-using code rather than repeating it.\n",
    "\n",
    "Note that the body of the function was *indented*. To mark the difference between the body of a function, for loop, or conditional clause, four spaces are used to indent each level. \n",
    "\n",
    "If you're ever unsure what a function does, just call `help(functionname)` on it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "help(add_number_to_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Installing and using packages (possibly remove)\n",
    "\n",
    "As well as not repeating yourself by using functions, it's a good idea to make use of other people's hard work whenever you can. That means importing their code and using it. There's an amazing open source software community producing code that makes a host of operations far easier.\n",
    "\n",
    "To install extra packages, normally you'd open up a command line (also known as the terminal) and write `pip install packagename` to install a package into whatever coding environment you're currently using. If you're using Google Colab to go through this quickstart interactively, just uncomment the first line of the example below by removing the `#` character and the space following it so that the first line begins with an exclamation mark and the pip install command.\n",
    "\n",
    "Below is a simple example of using an installed package, in this case a progress bar called `tdqm`. `sleep`, from the **time** package, is installed by default."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install tqdm\n",
    "from tqdm import tqdm\n",
    "from time import sleep\n",
    "\n",
    "text = \"\\n Text: \"\n",
    "for bit_of_text in tqdm([\"This\", \"is\", \"a\", \"string\"]):\n",
    "    sleep(0.25)\n",
    "    text = text + \" \" + bit_of_text\n",
    "\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summary\n",
    "\n",
    "There's a lot to think about when it comes to writing (good) code but, as we haven't got long, here are just a few key points:\n",
    "- beautiful, explicit, readable code is much better than complex code--the person who will read your code the most is future you! For example, use meaningful and informative variable names and comments.\n",
    "- Do not repeat yourself! If you find that you are writing the same thing over and over, use a function.\n",
    "- No-one remembers everything in a programming language. Use Stack Overflow (a forum), [cheat sheets](https://gto76.github.io/python-cheatsheet/), and the documentation websites of packages liberally. \n",
    "- Make use of other people's packages to avoid re-inventing the wheel, especially if they are widely used in the community.\n",
    "- We didn't cover it here, but eventually you'll want to use version control to track and save the changes you make in your code. This stops you from having files named 'final_definitelyfinal_v10.py' and instead gives you a full history of every change you've made for every single file.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mini-project\n",
    "\n",
    "In this mini-project we will aim to explain how *interest rates affect the selling price of houses*. **Big caveat**: this is *not* supposed to be a robust piece of analysis or a full-fledged research project; it's just a demonstration of some techniques and how to weave them together. If you were actually thinking about this problem there's a lot more that you'd want to take into account. You should not take the results of this analysis too seriously!\n",
    "\n",
    "In our exploration of this problem, we'll use house prices as sold from Ames, Iowa. This is mostly for convenience because there's a [Kaggle]() competition that uses this data, so it's publicly available (we'll be using the *train* version of the dataset). But it's worth noting that house prices as sold are different from house prices as advertised--they are prices that someone was willing to pay! They are at the equilibrium point in the supply-demand curve.\n",
    "\n",
    "### Housing\n",
    "\n",
    "Let's think for a moment about how we're going to tackle this problem.\n",
    "\n",
    "First, not all houses are created equal. They are not a commodity; they're hugely heterogeneous. We might not expect the construction of a 15 bed mansion with a pool and underground cinema to affect the supply or demand for a 1 bed studio apartment (though it might say something about general economic conditions!). So we must also control for the many effects that might influence the price of a house, such as the number of bedrooms, whether or not it is in good shape, and whether or not it has an underground cinema.\n",
    "\n",
    "Against type, we've considered heterogeneity before supply and demand. We should of course also expect the price of houses to vary in accordance with the supply and demand of housing, which means we must take into account any factors that change supply and demand.\n",
    "\n",
    "Finally, it's also useful to consider housing as an asset. This [paper] shows that it had one of the best rates of return over 19xx-19xx. Housing can yield rent for landlords or, implicily, a *user cost* for homeowners--that is a rent you 'pay' to yourself for living in the house you own. If housing is an asset, it's going to be affected by the price of other assets. We should therefore track the rate of return on other asstes.\n",
    "\n",
    "#### Demand\n",
    "\n",
    "The demand for housing is going to depend on the number of people, their propensity to live in close quarters with one another, how well-off people are (or expect to be), \n",
    "\n",
    "Costs incurred to maintain the property, including depreciation, maintenance, and insurance.\n",
    "\n",
    "#### Supply\n",
    "\n",
    "The most obvious factor is the supply of housing, i.e. are lots of new homes being built? We should put in a factor that captures this. We should also consider how well occupied *existing* housing is, so we'll incorporate a measure of how many vacant units there are.\n",
    "\n",
    "https://fred.stlouisfed.org/series/IAHVAC - housing vacancy\n",
    "\n",
    "https://fred.stlouisfed.org/series/IARVAC - rental vacancy\n",
    "\n",
    "There's another channel for supply to change. Over a long period of time, an unanticipated fall in real yields on other assets might make housing relatively more attractice, pushing up house prices. Eventually, we'd expect this to generate extra housing supply. \n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "Clearly, we'll be taking interest rates into account, because that's the relationship we're interested in. But we might also need to think about whether people have access to credit in order to get a mortgage, how well people are doing in general (and, perhaps, how well they expect to do in the future), and whether lots of new houses are being built.\n",
    "\n",
    "https://bankunderground.co.uk/2020/06/03/theres-more-to-house-prices-than-interest-rates/\n",
    "\n",
    "https://www.bankofengland.co.uk/working-paper/2019/uk-house-prices-and-three-decades-of-decline-in-the-risk-free-real-interest-rate\n",
    "\n",
    "Growth in house prices due to rising aggregate real income\n",
    "\n",
    "https://academic.oup.com/qje/article-abstract/99/4/729/1896452?redirectedFrom=fulltext\n",
    "\n",
    "https://fred.stlouisfed.org/series/HOUST\n",
    "\n",
    "https://fred.stlouisfed.org/series/DFII10 index-linked treasury\n",
    "\n",
    "Need something on credit conditions.\n",
    "\n",
    "## Read in, explore, and clean data\n",
    "\n",
    "### Reading in data\n",
    "\n",
    "For any and all empirical work, loading up the data you're using is going to be one of the first things you need to do. Fortunately, there are powerful open source solutions to load data in almost any format you can think of. And, in my experience, almost every data cleaning operation is different so bear in mind that in this example you'll just see *one* way that data cleaning might occur.\n",
    "\n",
    "We're going to use the ubiquitous data package [**pandas**](https://pandas.pydata.org/), which has ways to load data from formats including:\n",
    "- csv\n",
    "- Excel\n",
    "- txt\n",
    "- Stata\n",
    "- parquet (a fast, big data format)\n",
    "- json\n",
    "- the clipboard (yes, as in `Ctrl + C`!)\n",
    "- SAS\n",
    "- SPSS\n",
    "- SQL\n",
    "- Feather\n",
    "- HDF5\n",
    "\n",
    "The standard way to call this library in is via `import pandas as pd`. You can give whatever name you like to packages you import (for example, you could `import pandas as supercalifragilisticexpialidocious`), but there are a few conventions around for the very popular packages like **pandas** and shorter import names are going to save you time and effort.\n",
    "\n",
    "It's always easier to read in neatly formatted data but so often we find that real-world data is messy and needs a bit of work to get into shape. To demonstrate how this works in practice, we're going to work with a messy dataset and show how to read it in. You'll have probably heard that 80% of data science is cleaning the data--well it's true for empirical economics too!\n",
    "\n",
    "#### The problem\n",
    "\n",
    "The dataset we'll use is the Ames, Iowa house price data. The objective is to fit a model of house prices using regression. These data come from a Kaggle competition.\n",
    "\n",
    "Bring in housing deflator to adjust prices.\n",
    "\n",
    "The data come in a csv file, so we'll be using pandas' `read_csv` function. Then we'll take a first look at the first few rows of the data using the `head` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import os\n",
    "\n",
    "# (Change to link)\n",
    "df = pd.read_csv(os.path.join(\"data\", \"ames_iowa_house_prices.csv\"))\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exploring data\n",
    "\n",
    "\n",
    "What do we know about these data? We can use the `info` function to get a very high-level overview of the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gives us some basic information on quantitative columns, such as their mean and standard deviation. We can also see that there are many (60) columns and 39,644 rows.\n",
    "\n",
    "Really, we want a bit more information than this. Happily, there are more powerful tools we can bring to bear for exploratory data analysis; we'll use the [**pandas profiling**](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/) package. (Again, you may need to install this first by uncommenting the first line)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install pandas-profiling\n",
    "from pandas_profiling import ProfileReport"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "profile = ProfileReport(df, title=\"Profiling Report\", minimal=True)\n",
    "profile.to_notebook_iframe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a full on report about everything in our dataset! We can see, for instance, that we have 32 numerical variables, 1 boolean (1 or 0; true or false), and 48 categorical variables. \n",
    "\n",
    "The warnings page has plenty of useful info on data quality. `Alley` is missing most of its values, for example, `PoolArea` has a highly skewed distribution, and `EnclosedPorch` is mostly filled with zeros. We can also see that, usefully, the `Id` column is unique.\n",
    "\n",
    "Digging down into the detailed reports on each variable and toggling details on, we can see that, for example, `MSZoning` is a categorical variable with highly unbalanced classes.\n",
    "\n",
    "The absolute first and most useful thing you can do with a new dataset is to get to know it, warts and all. Here, we used the most basic pandas profile report (`minimal=True`), but you can opt for ones that include more analysis, although be wary with larger datasets as some extras do not scale well.\n",
    "\n",
    "Let's take a closer look at the variable we're trying to explain, which is `SalePrice`. For quick exploration, we'll use two of the most popular plotting libraries in Python, the imperative (build what you want) library **matplotlib**, which has very good integration with **pandas**, and the declarative (say what you want) library **seaborn**. We'll take a look at `SalePrice` and see if it's roughly log-normal ready for our regression later. We'll also modify the **matplotlib** default plotting style to be a bit more easy on the eye! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import scipy.stats as st\n",
    "import numpy as np\n",
    "\n",
    "plt.style.use(\"seaborn-notebook\")\n",
    "sns.distplot(df[\"SalePrice\"], bins=100, kde=False, fit=st.lognorm);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although there are *already* lots of variables in this dataset, we must recognise that what we want to explain comes from a range of times. In fact, we should look at the time range and sale price to see if there's anything to cause concern there, for instance time trends.\n",
    "\n",
    "First, let's create a proper datetime variable from the given `YrSold` column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"date\"] = pd.to_datetime(df[\"YrSold\"], format=\"%Y\")\n",
    "df[\"date\"].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are houses that were sold within the given year, but the default date setting has shifted the datetimes to the start of the year. Let's sort that out by shifting the datetimes to the end of the year, to prevent any information problems later in the analysis:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"date\"] = df[\"date\"] + pd.offsets.YearEnd()\n",
    "df[\"date\"].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's look at the time trend in sales price:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.lineplot(data=df, x=\"date\", y=\"SalePrice\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like prices fell over the period, which encompasses the Great Financial Crisis. We've no real reason to believe that falling prices were anything to do with local factors to do with buildings or garages in Ames, Iowa so this suggest we might need to bring some broader factors into play in order to explain house prices. The first is to adjust for inflation so that we are dealing with sold prices in *real terms*. Essentially, we want to remove the general effect of rising prices due to inflation.\n",
    "\n",
    "The wonderful data website [FRED]() has a time series which is a house price index that we can use as a deflator, namely the 'All-Transactions House Price Index for the United States', codename USSTHPI. But we need to get that data into the rest of our analysis. Fortunately, there's a package called **pandas-datareader** that exists to connect your analysis with online databases that we can use to pull down the relevant time series:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas_datareader as pdr\n",
    "\n",
    "start = df[\"date\"].min() - pd.offsets.DateOffset(years=1)\n",
    "end = df[\"date\"].max() + pd.offsets.DateOffset(years=1)\n",
    "hpi_deflator = pdr.get_data_fred(\"USSTHPI\", start, end)\n",
    "hpi_deflator.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hpi_deflator.plot();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Okay, now we need to fold this into our dataset with a merge. Let's use the date column to do this. We can see that the HPI is at a higher frequency than our dataset, so our first task is to down-sample it to annual frequency, by taking the mean, then we'll make sure it has a column with the same name for us to merge on:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds_hpi = (\n",
    "    hpi_deflator.groupby(pd.Grouper(freq=\"A\"))\n",
    "    .mean()\n",
    "    .reset_index()\n",
    "    .rename(columns=str.lower)\n",
    ")\n",
    "ds_hpi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now merge these two together while keeping only those entries in `df` that are also in `ds_pi`. Effectively, we are just tacking on the HPI to the existing dataframe. We'll show this by printing the first 5 rows and last 4 columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.merge(df, ds_hpi, on=\"date\", how=\"left\")\n",
    "df.iloc[:5, -4:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we have added the data and HPI, we can adjust the sale price and take a look at it as a real value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"log_real_price\"] = np.log(df[\"SalePrice\"] / df[\"ussthpi\"])\n",
    "sns.distplot(df[\"log_real_price\"], bins=100, kde=True, fit=st.norm)\n",
    "# Get the fitted parameters used by the function\n",
    "(mu, sigma) = st.norm.fit(df[\"log_real_price\"])\n",
    "plt.legend(\n",
    "    [\"KDE\", f\"Normal dist. ($\\mu=$ {mu:.2f} and $\\sigma=$ {sigma:.2f} )\"],\n",
    "    loc=\"upper left\",\n",
    "    frameon=False,\n",
    ")\n",
    "fig = plt.figure()\n",
    "res = st.probplot(df[\"log_real_price\"], plot=plt)\n",
    "plt.show();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "exclude_vars = [\"date\", \"ussthpi\", \"Id\", \"SalePrice\"]\n",
    "# Do this as a column of correlations?\n",
    "sns.heatmap(\n",
    "    df[[x for x in df.columns if x not in exclude_vars]].corr(), vmin=-1, vmax=1\n",
    ");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's a first problem: there is whitespace before some column names. That might make it easy to make an error later on, for example if we look at a column we might type `df['column_name']` instead of `df[' column_name']`. So, as a first step in cleaning this dataset, let's eliminate any leading or trailing whitespace."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.rename(columns=dict(zip(df.columns, [x.lstrip().rstrip() for x in df.columns])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There's a lot to unpack above. The rename function accepts a dictionary, a Python data type, which maps one set of variables into another. An example would be `{' columnname': 'columname'}`, with the mapping going from before the `:` to after. `zip` is a command that pairs every element of a first list with every element of a second list. Finally, the second list that is passed to `zip` consists of the original column names but with leading (`lstrip`) and trailing (`rstrip`) whitespace removed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Variable values\n",
    "\n",
    "We should take special care with the values of the variables we're interested in: shares, and num\\_imgs and num\\_videos. Looking at those three in the profiling section, we can see that shares is heavily skewed, with most of the mass around (but greater than zero) and a very small number of larger values. This suggests taking the log of this variable. Secondly, we see the num\\_videos parameter has a 95th percentile of 6, a very small number. The distribution is very skewed, with most values being zero. This suggests whether an article contains a video or not might be a better measure of 'videos' than the number of them. Finally, num\\_imgs has quite a skewed distribution too, but it's not quite as extreme or centred on zero--so we'll leave this variable as it is.\n",
    "\n",
    "Let's put those changes through:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "df[\"log_shares\"] = np.log(df[\"shares\"])\n",
    "df[\"video\"] = df[\"num_videos\"] > 0\n",
    "df[\"video\"] = df[\"video\"].astype(\"category\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summary\n",
    "\n",
    "Reading in, exploring, and cleaning data are usually the hardest parts of (economic) data science. Data can catch you out so easily, and in so many different ways. The example above is very bespoke to the data, and this is typical of all data cleaning and preparation exercises. We didn't even see common operations like merging two different datasets! The best advice I can give is to start experimenting with data cleaning in order to come across some common themes.\n",
    "\n",
    "Remember:\n",
    "- understanding your data is most of the battle--running a model on cleaned data is the easy part\n",
    "- how you read, explore, and clean your data will depend entirely on the question you are trying to answer\n",
    "- \n",
    "\n",
    "## Analysis\n",
    "\n",
    "We will now try to explain the number of shares ('shares' in the dataset) of an article based on characteristics of the article in the dataset. Specifically, we are interested in whether having rich media, such as images and video, helps increases the shares of articles. We can do that by using ordinary least squares (OLS) to regress shares on the variables representing the amount of rich media content. \n",
    "\n",
    "Let's start with the simplest model we can think of, which is just regressing the log(shares) on the fixed effects from the weekday and data channel as well as the number of images, number of videos, and number of links to other articles. We'll use the [**statsmodels**] package and its formula API. This lets us use text to specify a model we want to estimate. Putting 'C(variable_name)' into the formula tells statsmodels to treat that variable as a categorical variable, also known as a fixed effect.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import statsmodels.formula.api as smf\n",
    "\n",
    "model_basic = smf.ols(\n",
    "    \"log_shares ~ C(data_channel)  + C(wkday) + num_imgs + C(video) + num_hrefs + C(quarter)\",\n",
    "    data=df,\n",
    ")\n",
    "results_basic = model_basic.fit(cov_type=\"HC1\")\n",
    "print(results_basic.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So it looks like a unit increase in the number of images is associated with a 0.46% increase in the number of shares.\n",
    "\n",
    "However, there are a LOT of other variables in this dataset that we haven't used. Omitting them could be influencing the parameters we're seeing. So actually the first thing we should be doing is considering whether we need to include these other variables. As many of them could also have an influence on shares, we probably should--but there are just so many!\n",
    "\n",
    "The easiest way to thing about them is to break them down into similar groups of variables. There are some that count tokens (eg individual words), some looking at sentiment and polarity, and some looking at the title of the article. Then there are a few miscellaneous ones left over (such as url, which we can safely not use in the regression). Let's try and group these use Python's list comprehensions.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "token_vars = [x for x in df.columns if \"token\" in x]\n",
    "sentiment_vars = [\n",
    "    x\n",
    "    for x in df.columns\n",
    "    if (((\"sentiment\" in x) or (\"polarity\" in x)) and (\"title\" not in x))\n",
    "]\n",
    "keyword_vars = [x for x in df.columns if \"kw\" in x]\n",
    "title_vars = [x for x in df.columns if ((\"title\" in x) and (x not in token_vars))]\n",
    "# Let's look at one of these as an example:\n",
    "print(\", \\t\".join(title_vars))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great, there are now four distinct groups of variables in addition to the ones that were already considered.\n",
    "\n",
    "We *could* just throw everything into a model (the kitchen sink approach) but some of the variables in the data are likely to be very highly correlated, and multi-collinearity will create issues for our regression. Let's first look at whether any of the variables we haven't already discussed are highly correlated. Just taking the correlation of *all* of the variables will create a huge matrix, so we'll also cut it down to pairs that are highly correlated.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "corr_mat = df[token_vars + title_vars + sentiment_vars + keyword_vars].corr()\n",
    "# Grab any pairs that are problematic\n",
    "corr_mat[((corr_mat > 0.7) | (corr_mat < -0.7)) & (corr_mat != 1)].dropna(\n",
    "    how=\"all\", axis=0\n",
    ").dropna(how=\"all\", axis=1).fillna(\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's clear from this there are quite a few pairs of correlated variables within each group of variables, and we should be cautious about lumping them all in together.\n",
    "\n",
    "We have many choices at this point but, without going into too much detail, we could either remove some of the independent variables or combine them. In this case, we think that there could still be useful information in the variables so we'd like to keep them but whittle down their information to fewer variables. This sounds like a job for unsupervised machine learning!\n",
    "\n",
    "### Dimensional reduction\n",
    "\n",
    "We will make use of the UMAP algorithm to take the sets of variables we've identified, which consists of 25 variables in total, and squish them down to just four dimensions (variables), on the basis that we there are probably only really four different bits of information here.\n",
    "\n",
    "We'll also make use of a scaler, an algorithm that puts the different data on the same scale. This helps the UMAP algorithm more effectively perform dimensional reduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gives us a slightly different value for the impact of the number of videos on the percentage of shares of an article: 0.38%, versus 0.46% from earlier. How can we square these? They did use slightly different specifications. In fact, there were many choices of specification we could have made throughout this process. This garden of forking paths is a problem if we want to have confidence in the relationship that we're interested in; the results should not be fragile to small changes in specification.\n",
    "\n",
    "Fortunately, there are ways to think about this more comprehensively. One trick is to use *specification curve analysis*. This looks at a range of plausible specifications and plots them out. By comparing so many specifications, we get a better idea of whether the preferred specification is a fragile outlier or a robust results.\n",
    "\n",
    "We'll create a specification curve for the association between the number of images and the number of shares using the [**specification_curve**](https://specification-curve.readthedocs.io/en/latest/readme.html) package (disclaimer: I wrote this package!)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from specification_curve import specification_curve as specy\n",
    "\n",
    "sc = specy.SpecificationCurve(\n",
    "    df,\n",
    "    \"log_shares\",\n",
    "    \"num_imgs\",\n",
    "    [\n",
    "        \"umap_0\",\n",
    "        \"umap_1\",\n",
    "        \"umap_2\",\n",
    "        \"umap_3\",\n",
    "        \"num_hrefs\",\n",
    "        \"wkday\",\n",
    "        \"data_channel\",\n",
    "        \"quarter\",\n",
    "        \"video\",\n",
    "    ],\n",
    "    always_include=[\"video\", \"num_hrefs\"],\n",
    ")\n",
    "sc.fit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.plot(\n",
    "    preferred_spec=[\n",
    "        \"log_shares\",\n",
    "        \"num_imgs\",\n",
    "        \"umap_0\",\n",
    "        \"umap_1\",\n",
    "        \"umap_2\",\n",
    "        \"umap_3\",\n",
    "        \"num_hrefs\",\n",
    "        \"wkday\",\n",
    "        \"data_channel\",\n",
    "        \"quarter\",\n",
    "        \"video\",\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the specification curve, we can see that most estimates are clustered around the 0.35%--0.50% range *if* the number of links and video fixed effect are both included as regressors. These are both similar to the variable we're interested in, so it seems reasonable to always include them. The preferred specification is right at the lower end of the range, but includes all of the controls.\n",
    "\n",
    "### Summary\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Presenting results"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "codeforecon",
   "language": "python",
   "name": "codeforecon"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
